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Abstract 

Sixty-seven participants (39 men and 28 women), ranging in age from 26 to 79 years, were 
administered Raven's Advanced Progressive Matrices (APM) on three occasions. Although total APM 
scores were found to be highly reliable across the three occasions, the reliabilities of most individual items 
were extremely low. A single-factor model remained a borderline adequate fit (explaining approximately 
20% of the variance) for the interitem correlation matrix on all three occasions. Total APM scores 
increased significantly across the three occasions (approximately two items per occasion). Improvements 
in total score across the occasions happened within a context of subjects changing both correct and 
incorrect responses from the previous occasion. The number of items left unanswered was found to be 
unrelated to both APM score on any given occasion and the amount of gain in score made across occasions. 
These findings suggest that the improvements in performance were not based on the acquisition of a 
strategy design to respond to more items or on the retention of item-specific information, but rather, the 
improvement reflected learning, something common to the types of items found in the APM. 
© 2003 Elsevier Science Inc. All rights reserved. 
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1. Introduction 

Raven's Advanced Progressive Matrices test (APM, 1962 revision, Raven, 1962) has 
enjoyed considerable service in both applied and research settings. It has been deemed to be a 
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measure of a range of related concepts such as general intelligence, fluid intelligence, abstract 
reasoning, and inductive reasoning (Court, 1994). Recent experimental approaches to 
intelligence, broadly understood, have typically employed the APM as a measure of ability 
to which individual performance, usually latency, on elementary cognitive tasks are correlated 
and then theoretically interpreted. Thus, the APM plays a prominent role in our endeavors to 
understand the nature of intellectual abilities. 

The key to any theoretical interpretation of such correlations is having an understanding of 
both response latency on simple cognitive tasks and the APM. Any theoretical interpretation, 
such as the current speed-of-processing accounts (Jensen, 1982, 1987; Neubauer, 1995; 
Vernon, 1987), presupposes specific assumptions concerning the nature of the individual 
tasks. The natures of such tasks, even in the case of choice reaction time (Carlson & 
Widaman, 1987; Detterman, 1987; Dittrich & Henderson, 1999; Henderson & Dittrich, 1998; 
Juhel, 1991), however, are not givens. Thus, furthering our understanding of the APM, what 
it is measuring, and what influences performance on it is of vital importance. Unfortunately, 
there has been little exploration of the behavior of the APM, with respect to either total score 
or individual items, save for factor analytic studies. Exceptions include explorations such as 
the studies of Carpenter, Just, and Shell (1990), DeShon, Chan, and Weissbein (1995), and 
Verguts, De Boeck, and Maris (2000). 

The present paper, a detailed description of the effects of repeated test administrations, is 
first of all an attempt to understand what is acquired or learned through practice on the APM. 
Secondly, such an exploration will hopefully shed some light on the nature of the APM and 
individual differences in performance. Previous studies of the APM, as well as studies of the 
Standard Progressive Matrices (SPM; Raven, 1938, 1956), have reported improvements in 
scores across testing occasions (typically a two- to five-item increase). Although, psycho- 
metrically speaking, practice effects are undesirable, the main concern has been one of 
reliability. There are, however, other issues that accompany practice effects, such as the 
possible effects of memory. Is the test reliable because subjects are able to remember the 
responses they made on the previous occasion? Although the answer to such questions would 
be informative, we are unaware of any study that has systematically scrutinized practice 
effects. This would include examining possible alterations in the psychometric properties of 
the test as a whole, the items individually, and the identification of possible covariates of 
improvement. 

It appears that authors have merely noted that the APM shows, as do most tests of 
cognitive ability, a practice effect, and leave it at that or argue that the important issue is 
reliability (Raven, Court, & Raven, 1988). In some studies, the practice effect has been used 
as a baseline for exploring the possible effects of various training strategies (Diemand, 
Schuler, & Stapf, 1991). In other studies, it has been used as a baseline for examining the 
effects of certain experiences, such as that of classical music (Newman et al., 1995). Practice 
effects have also been incidentally reported, such as in Bors and Forrin's (1995) study of the 
relations among age, mental speed, and IQ. All of these authors have likely assumed that the 
observed practice effect is simply the result of improvement either in general or in specific 
test-taking skills and strategies or that it is related to increased familiarity with the types of 
items on the APM. These are fair assumptions. There may be other reasons, however, for the 
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improvement, at least among some of the subjects. Even if general and specific test-taking 
strategies are responsible for the bulk of the practice effect, examination of individual 
differences in the effect may reveal something about individual differences in overall 
performance on the APM. 

Little is known about the APM practice effect itself, save for the fact that the mean size of 
the effect appears to be independent of age and gender (Bors & Forrin, 1995). There are 
comparable findings with respect to the SPM (Anderson et al, 1986; Denney & Heidrich, 
1990). Although nonsignificant, in the Denney and Heidrich (1990) study, however, there 
was a trend toward larger practice effects with age (young = 0.4; middle aged = 1 .2; old = 2.9). 
Possible trends in other studies are not mentioned and we have been unable to locate a good 
set of descriptive statistics concerning the phenomenon. We do not know what proportion of 
subjects show the effect. 

This paper represents a beginning. It is a detailed description of the practice effect and an 
exploration of several questions, focusing on what is learned. Is the practice effect due to 
subjects making fewer errors on items that are less difficult than their highest level of their 
previous performance or is it that they are able to perform at higher levels of difficulty? It well 
may be that both are the case. Are there stable individual differences with respect to the nature 
of the practice effect? Is there any change in the test's factor structure with practice? Most 
studies of the factor structure of the APM conclude that a single-factor model best fits the 
interitem correlation matrix (Alderton & Larson, 1990; Arthur & Woehr, 1993; Bors & 
Stokes, 1998; DeShon et al, 1995; Paul, 1985). Does the fit of this model improve or 
deteriorate with repeated testing? If there is a single predominant factor underlying 
performance, then, as incidental noise is reduced with practice, the fit of the single-factor 
model should improve. In addition to these questions concerning the test taken as a whole, 
there are important questions about the behavior of individual items. Such a focus is required, 
given that most of the cognitive analyses of abilities measured by the APM focus on 
individual items and differences in performance on items as the level of analysis. Unfortu- 
nately, it would be expected that individual items would be less reliable than the test as a 
whole. However, are the items so unreliable as to make them questionable as units of 
analysis? 



2. Method 

2.1. Participants 

Data from most of the participants in the present study have been reported in Bors and 
Forrin's (1995) study of age, speed of information processing, and fluid intelligence. With 
respect to the APM, only the test-retest correlations and the significant practice effect were 
reported, along with correlations between the APM and speed of processing measures. In the 
present paper, data from 67 subjects are reported: 39 men and 28 women. The ages of the 
participants ranged from 26 to 79 (M= 46.05, S.D. = 12.22). The mean age of the men was 
45.77 (S.D. = 12.17) and that of the women was 46.46 (S.D. = 12.50). All subjects were in 



294 D.A. Bors, F. Vigneau / Learning and Individual Differences 13 (2003) 291-312 

good health and had excellent health histories. Details of medical examinations can be found 
in Bors and Forrin's study. 

2.2. Procedure 

All participants were administered the APM on three separate occasions, with intervals of 
approximately 45 days between testing occasions. Both the 12-item practice set (Set I) and 
the 36-item test set (Set II) were administered on all three occasions. Standard instruction and 
timing (5 min for Set I and 40 min for Set II) were followed on all occasions. Participants 
were tested in small groups (four to six). The present study focuses solely on the performance 
on Set H. 



3. Results 

As can be seen in Table 1 , scores on the APM increased over the three testing occasions, 
F(2,132) = 33.67, MSE=6.02, P<.00l. The mean score on the second occasion (M= 20.45, 
S.D. = 6.46) was greater than that on the first occasion (M= 18.51, S.D. = 6.34) and the mean 
score on the third occasion (M=21.85, S.D. = 7.04) was greater than that of the second. The 
fact that the linear trend across the three testing occasions was significant, F( 1,66) = 62. 16, 
MSE= 6.02, P< .001, and that the quadratic trend was not, F< 1.0, indicates that the general 
practice effect was not diminishing. Said differently, there was no difference between the 
mean gain scores from the first to the second occasion (M= 1.94, S.D. = 3.56) and the mean 
gain scores from the second to the third occasion (M= 1.40, S.D. = 2.99). Additionally, given 
that none of the participants reached the maximum score and that the variability in scores 

Table 1 



Descriptive statistics of APM score, number of items unanswered and number of items erroneously answered by 
occasion 





Minimum 


Maximum 


Mean 


S.D. 


Skewness 


Kurtosis 


Score 














Occasion 1 


4.00 


32.00 


18.51 


6.34 


- 0.276 


-0.118 


Occasion 2 


3.00 


35.00 


20.45 


6.46 


-0.278 


0.119 


Occasion 3 


2.00 


35.00 


21.85 


7.04 


-0.670 


0.227 


Number of items 


left unanswered 












Occasion 1 


0.00 


24.00 


6.76 


5.29 


0.988 


1.535 


Occasion 2 


0.00 


21.00 


5.49 


4.49 


0.778 


0.770 


Occasion 3 


0.00 


20.00 


3.87 


4.32 


1.387 


2.363 


Number of items 


erroneously answered 










Occasion 1 


1.00 


31.00 


10.73 


7.68 


0.812 


- 0.002 


Occasion 2 


0.00 


28.00 


10.06 


7.03 


0.909 


0.018 


Occasion 3 


1.00 


34.00 


10.28 


8.01 


1.236 


0.750 
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across occasions did not diminish, none of the effects reported below can be viewed as 
artifacts of a ceiling effect. A detailed examination of the gain scores will follow later. 

A breakdown of the reciprocal of performance score, the number of items incorrect, into its 
constituents, the number of problems left unanswered (blank) and the number of wrong 
answers (erroneous), shows that on average the practice effect is paralleled by a reduction in 
the number of items left unanswered, F(2,132)= 15.30, MSE=9.23, P<.00l (see Table 1). 
As was the case for the increase in mean scores across occasions, the linear trend in the 
reduction of items left unanswered was significant, F(l,66) = 23.73, MSE= 11.84, P<.001, 
but the quadratic trend was not, F< 1.0. By comparison, there were no differences among the 
mean number of erroneous answers across the three occasions (F<1.0): Occasion 1 
(M= 10.73, S.D. = 7.68), Occasion 2 (M= 10.06, S.D. = 7.03), Occasion 3 (M= 10.28, 
S.D. = 8.01). 

It should be kept in mind, however, that although this breakdown of incorrect items into 
items erroneously answered and items left unanswered does produce sets of items that are 
mutually exclusive within an occasion for a given subject, it does not necessarily produce sets 
that are mutually exclusive across occasions for a given subject. It might be the case that 
some of the unanswered items on the first occasion become items answered erroneously on a 
subsequent occasion whereas some of the problems erroneously answered on the first 
occasion are correctly solved on a subsequent occasion. Thus, the total number of erroneous 
responses may remain stable, but the actual items involved are changing. 

Although the mean APM scores changed significantly, the rank ordering of the scores was 
found to be stable across the three occasions. The correlation between the scores on Occasion 
1 and Occasion 2 was .85, between Occasion 2 and Occasion 3 was .91, and between 
Occasion 1 and Occasion 3 was .87. 

The Cronbach's alphas (Occasion 1=.88, Occasion 2=.89, Occasion 3=90) and the 
Spearman-Brown split-half reliabilities (Occasion 1=.90, Occasion 2=.91, Occasion 3=.93) 
for the APM responses on the three occasions illustrated that the APM was internally 
consistent on all occasions. (The correlations matrices between the items for the three testing 
occasions can be found in Appendices A-C.) The alphas also suggested that, consistent with 
the bulk of the literature in the area, a single-factor model best accounts for the interitem 
correlations on the test. Using confirmatory factor analysis, a simple single-factor model was 
applied to the data from the three occasions. Although the x 2 goodness-of-fit indices were 
statistically significantly different from each other, the differences are small and the three 
RMSEAs reported in Table 2 indicated that the single-factor model remained a borderline 
adequate fit across the three occasions. This was consistent with previous findings. 
Additionally, the correlations among the parameter estimates (loadings of the variables on 

Table 2 



Goodness-of-fit indices for the confirmatory factor analysis of the single-factor model 





x 2 


df 


RMSEA 


Lower 


Upper 


Occasion 1 


898 


594 


.0666 


.0522 


.0798 


Occasion 2 


938 


594 


.0838 


.0716 


.0956 


Occasion 3 


842 


594 


.0734 


.0600 


.0886 
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the factors) from the three analyses (Occasions 1 and 2=.62, Occasions 2 and 3=.42, 
Occasions 1 and 3=.78), indicated a moderate, although not a strong, similarity in internal 
structure across the occasions. 

When viewed from the level of group means (aggregate) across occasions it appeared 
that it was the number of unanswered items that was prognostic with respect to 
performance. This may suggest that subjects are acquiring a strategy of answering more 
questions. Thus, it may be that the practice effect was not the result of learning something 
about solving the items, but was merely an artifact of attempting more items. When 
examined within occasions, the picture began to change. Within occasions, the number of 
unanswered items was only weakly and nonsignificantly correlated with APM score 
(Occasion 1 = — .14, Occasion 2 = — .20, Occasion 3 = — .07). That is, a participant's score 
on any given occasion was found to be unrelated to the number of items he or she left 
unattempted. 

3.1. Influence of gender and age 

As depicted in Table 3, there were no main effects for gender on total score or the number 
of items left unanswered on any of the occasions nor were there any interactions between 
gender and occasion with respect to any of these measures. Although age was weakly but 
significantly correlated with total APM score (r=— .25, —.24, —.31, all Ps<.05, for 
Occasions 1, 2 and 3, respectively), age was not a predictor of the number of items left 
unanswered. 

3.2. The practice effect at the item level 

3.2.1. Characteristics of items inferred from aggregate results 

Despite the practice effects, as a whole, the APM appears quite stable, both across 
occasions and internally within occasions. Stability can also be found at the item level. 

Table 3 

Means and standard deviations of score, number of items left unanswered, and number of items erroneously 
answered, by occasion and gender 

Male (n = 39) Female (n = 28) 



Mean S.D. Mean S.D. 



Score 

Occasion 1 18.00 7.13 19.21 5.07 

Occasion 2 20.03 7.15 21.04 5.43 

Occasion 3 21.46 8.06 22.39 5.40 

Number of items left unanswered 

Occasion 1 6.92 5.27 6.54 5.39 

Occasion 2 5.34 4.09 5.68 5.07 

Occasion 3 3.64 3.91 4.18 4.90 
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The difficulty levels — the proportions of a sample of participants who incorrectly answer 
the item — of APM items have been consistently found to increase monotonously from the 
first to the last (Forbes, 1964 [see also Raven et al., 1988]; Carpenter et al., 1990; Paul, 
1985). The rank ordering of difficulties in the present study was in close agreement with 
that presented in the APM manual (Pearson's rs: .97, .98, and .98 for Occasions 1, 2 and 
3, respectively; Kendall's ts: .82, .88, and .85 for Occasions 1, 2 and 3, respectively). It 
was also highly stable from one occasion to another (all Pearson's rs at .98 or above; 
Kendall's ts of .85 [Occasions 1 and 2], .91 [Occasions 2 and 3], and .84 [Occasions 1 
and 3]). 

The stability at the item level was also found at a qualitative level. As reported in the APM 
manual (Raven et al., 1988; see Forbes, 1964), for each item there is a most common error, 
one whose frequency is greater than the others. We found that the most common error was 
consistent across all three occasions. The most common error on the first occasion was the 
most or second most common error for 32 of the 36 items on subsequent occasions. For two 
of the items where the most common error was not consistent, the frequencies for any given 
erroneous response were all very low. 

3.2.2. Items at the level of individual differences 

As would be expected, the picture of stability at the item level changed somewhat when an 
individual differences analysis was carried out. From the perspective of individual differ- 
ences, the behavior of items was considerably less consistent than was the case with the test 
as a whole. The within-occasion average interitem correlations (Occasion 1=.16, Occasion 
2=.18, Occasion 3=21) illustrate that a participant's performance on any given item was not 
indicative of his performance on any other given item. 

The picture was also somewhat surprising when we examined the items across 
occasions. Performance on each of the 36 items was examined separately across the 
three occasions. The test-retest correlations for the items across Occasions 1 and 2 
ranged from — .08 to .58 (M=.34), across Occasions 2 and 3 they ranged from — .03 to 
.90 (M=.40), and across Occasions 1 and 3 the correlations ranged from — .04 to .70 
(M=.36). Even when the practice effect was taken into consideration, the test-retest 
correlations for the 36 items appeared surprisingly low, with the exceptions of a few items 
that were either extremely difficult or extremely easy. The average practice effect was 
approximately two items across each occasion. Thus, to account for such low reliabilities, 
there must have been an additional number of items switching between being correct and 
incorrect. 

3.2.3. Analysis of gains 

As mentioned above, the average gain across Occasions 2 and 3 (1.40 items) was 
comparable to the average gain across Occasions 1 and 2 (1.94 items). What do these gains 
represent, in terms of a participant's performance on the specific items across occasions? For 
example, a gain of three items made by a participant may be the result of solving correctly 
five items that were incorrectly answered previously (positive switches) and answering 
incorrectly two items that were correctly solved previously (negative switches). We can refer 
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to the total number of positive and negative switches as the fluctuation. As might be 
anticipated from the very weak test-retest reliabilities, the above example is not far removed 
from the typical scenario. For example, across Occasions 1 and 2, the positive switches 
ranged from 0 to 12 (M=4.93, S.D. = 2.88) and the negative switches ranged from 0 to 9 
(M=2.99, S.D., 2.04). Evidently, although the change in total APM score may be small or 
even nonexistent, which items are answered correctly and which items are answered 
incorrectly across the two occasions may fluctuate greatly. Given the average scores 
(Occasion 1 = 18.51, Occasion 2 = 20.45), the extent of the fluctuation was considerable 
(M=7.91 items, S.D. = 3.50). The pictures across Occasions 2 and 3 and across Occasions 1 
and 3 were similar. 

To this point, we have deconstructed participants' performance on the APM across 
occasions into positive and negative switches. In a somewhat different manner, we might 
pose our question, what do gains across occasions represent? For example, a gain of three 
items made by a participant may be the result of solving three items beyond the last correctly 
answered item on the previous occasion (extension) or it may simply be that the participant 
now is correctly answering items that were answered incorrectly before the last correctly 
solved item on the previous occasion (filling). Stated simply, are subjects answering more 
difficult items or are they making fewer errors on easier items? As seen in Table 4, gain was 
based on both filling and extension. The mean differences between extension and filling in all 
cases were nonsignificant. Thus, both extension and filling contributed to improvement in 
overall performance. 

Finally, we may ask if the change in the number of items left unanswered by a participant 
is predictive of his or her gain. Even though, as reported above, the number of items left 
unanswered was not predictive of performance within an occasion, the change in the number 
of items left unanswered may be so. If so, it would be indicative of the importance of simply 
attempting more items. However, across the three occasions, the correlation between the 
change in the number of items left unanswered and gain was .01. That is, the number of 



Table 4 

Extension and filling: descriptive statistics 





Minimum 


Maximum 


Mean 


S.D. 


Across Occasions 1 and 2 
Number of items extension 
Number of items filling 


-8 
-3 


8 
8 


0.75 
1.19 


2.57 
2.62 


Across Occasions 2 and 3 
Number of items extension 
Number of items filling 


-2 
-8 


7 
6 


0.99 
0.42 


1.87 
2.68 


Across Occasions 1 and 3 
Number of items extension 
Number of items filling 


-2 
-6 


7 

12 


1.78 
1.57 


2.06 
3.18 
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additional items attempted on a subsequent occasion was unrelated to the amount of 
improvement in the participant's total APM score. 

4. Discussion 

Similar to what other studies have found (Diemand et al., 1991), the APM performance of 
almost all participants in the present research improved with repeated testing. An increase in 
the average APM score was paralleled by a reduction in the average number of items left 
unanswered. Improvement in score, however, was highly variable across individuals, with 
some participants gaining up to 10 points, and others performing poorer on the second or 
third testing occasion than they had on the first, a finding reported previously with the 
standard version of the matrices (Ostijn, 1970). 

At the level of individual differences, and in contrast with the picture at the aggregate level, 
we observed that neither total score nor amount of improvement was related to the number of 
items left unanswered. This constitutes a clear illustration of the importance of making a 
distinction between the aggregate and the individual differences levels when trying to identify 
determinants of cognitive performance. 

Consistent with the variability in gain, we observed that improvement was not simply 
the result of answering a few slightly more difficult items on a subsequent occasion 
(extension), neither was it simply the result of correcting errors made on relatively easy 
items (filling). The practice effect emerged out of considerable fluctuation in the 
participants' responses, where participants were answering correctly some items that were 
previously answered incorrectly and answering incorrectly some items that were previously 
answered correctly. This was reflected in the large number of positive and negative 
switches and resulted in lower than expected reliabilities for the items. Given this high 
degree of switching, improvement in performance does not appear to be related to memory 
of specific items. This switching, along with the fact that the change in the number of 
items left unanswered was found to be only weakly related to improvement in scores, 
suggests that there was learning with respect to how to solve such reasoning problems, not 
specific problems. 

In the midst of this flux, we saw stability at two levels: at the level of the individual 
differences in total score, and at the level of the behavior of the test itself. Over the three 
testing occasions, the rank ordering of the participants' scores was highly stable (all test- 
retest reliabilities were at .85 or above). Only a few participants markedly changed, one 
way or another, their relative standing in the group. The test-retest reliability of the APM is 
not questioned here. On the contrary, with a between-administration interval of 6 weeks, the 
test-retest reliability was as great as one would expect or desire from much briefer 
intervals. 

With respect to the behavior of the test itself, the three strong alphas registered a continued 
high internal consistency. The rank ordering of the items in term of their difficulties was 
highly consistent across testing occasions. These rank orderings were also very similar to the 
one obtained with the normative sample (Forbes, 1964) and presented in the APM manual 
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(Raven et al., 1988). However, it should be said that items left unanswered, particularly in the 
last third of the test, surely contributed to the stability of the "difficulties" across occasions. 
Different results might have been obtained if all participants had completed all the items. 
Because a large proportion of the participants leave the last items unanswered, the "real" 
difficulty level of these items cannot be determined, a fact Yates (1966) mentioned several 
decades ago. 

What do the present findings tell us about the nature of the APM? At one level, this is a 
question related to the test's factor structure. As mentioned in the introduction, the APM 
usually is thought to be dominated by a single factor. This does not necessarily mean, 
however, that one predominant factor influences performance on any given item or on the 
APM as a whole. The items might share, to varying degrees, several factors. That is, 
performance on all of the items might be influenced by a number of determinants. The more 
they share these determinants, the more the variance can be attributed to a single factor. To 
the extent that not all the items are influenced by all of the determinants, or to the extent that 
the items share the determinants to varying degrees, the less stable and the less powerful will 
be a single-factor model. 

The acceptance of the single-factor model has most likely been the result of assuming that 
the presence of more than one factor would mean that there will be more than one cluster of 
items, and there have been such reports concerning the APM (Dillon, Pohlmann, & Lohman, 
1981). Failure to find more than one cluster of items does not mean that there is only one 
factor, however. It may simply mean that the items are not factor specific. Failure to find more 
than a single factor may tell us more about the limitations of factor analysis than about the 
nature of the APM. 

As mentioned in the introduction, although most authors have favored the single-factor 
model, it has been more of an acceptance than a conviction. When the figure is reported, the 
single-factor model typically has not explained a very large proportion (not more than 20%) 
of the item variance. Furthermore, although one would not desire a strong average interitem 
correlation, the extremely low interitem correlations found in the present study are 
problematic for a single-factor model. Furthermore, if there were a single operative factor, 
one might expect with practice to see that factor's importance wax as the noise-producing 
incidental influences waned. This would be reflected in an increase in the model's 
explanatory power. With practice, however, we found that the fit of the single factor did 
not substantially improve. 

These results — the low test-retest reliability of the items, the low interitem correlations, 
and the modest explanatory power of the single-factor model — all emphasize a weak 
reliability of the test at the level of the items, which is in sharp contrast with the APM's 
very satisfactory reliability at the level of the total score. Our emphasis on the weak 
reliability at the item level could appear to be a rather trivial observation if it were not for 
the fact that some of the most recent cognitive analyses of the APM use elements defined at 
the item level as a basis for theoretical models of individual differences in performance. In 
Carpenter et al. (1990), for example, assumptions are made regarding the types of rules and 
the number of instances of such rules that need to be handled in working memory to arrive 
at the correct answer. In this case, item difficulty is modeled as a product of (1) the types of 
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rules involved, hypothetically depending on induction processes and (2) the number of 
instances of rules involved, presumably taxing working memory and goal management 
abilities. The characterization of APM items in terms of types and number of rules forms the 
starting conditions for the elaboration of Carpenter et al.'s model of individual differences in 
APM performance. In another, more recent study, Verguts et al. (2000) concentrated on rule 
induction processes and argued that generation speed, the speed at which people find rules 
that potentially govern the variations in the matrices, is an important factor in APM 
performance. However, as was the case with Carpenter et al. (1990), their demonstration is 
based on an initial characterization of the items, this time in terms of the difficulty of rule 
finding, as assessed by qualitative appreciations of external observers. If the presence of a 
single, weak factor at the individual difference level, the low test-retest reliability of the 
items, and the low interitem correlations found in the present study are to be taken seriously, 
the foundations of both these examples are likely to be viewed as unstable. Furthermore, if 
some items share characteristics, such as working memory load or difficulty in rule finding 
that determine APM performance, why do not they form clusters of items that would be 
identified by factor analysis? 

The problem, of course, is not one of using item characteristics to derive testable hypotheses 
concerning mental functioning. The problem arises when we elaborate a model based on group 
aggregate results and then draw conclusions at the level of individual differences. This is well 
illustrated by results in the present study: Although the improvement in total score was 
paralleled by reduction in the number of items left unanswered at the aggregate level, there was 
no correlation between amount of improvement and the amount of reduction in the number of 
items left unanswered, at the level of individual differences. Returning to the studies of 
Carpenter et al. (1990) and of Verguts et al. (2000), their theoretical reliance on aggregate 
characteristics of the items makes one wonder if some other factor or factors would not be as 
satisfactory an explanation of the individual differences in performance. 

In conclusion, a practice effect was found across three administrations of the APM. 
Improvements in total score across the occasions happened within a context of subjects 
changing both correct and incorrect responses. This response fluctuation, along with the very 
low item reliabilities, suggest that the improvements in performance were not based on the 
acquisition or on the retention of item-specific information, but rather on the development or 
refinement of some process or activity more general in nature. Such general processes or 
activities could of course be at the level of general test-taking strategies or they could be at a 
level more specific to the types of items found in the APM. Further elucidation of what is 
actually being acquired or honed with practice on the APM would be a considerable step 
toward our understanding of the test and, perhaps, in our understanding of what is responsible 
for individual differences in performance. 
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Appendix B. APM item correlation (Occasion 2) 
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Appendix C. APM item correlation (Occasion 3) 
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